Search CORE

41 research outputs found

Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices

Author: Baruch E.
Cannon L. E.
Choi J
Choi J.
Choi J.
Solomonik E.
Szabo A.
van de Geijn R. A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/10/2015
Field of study

A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text overlap with arXiv:1504.0504

arXiv.org e-Print Archive

Crossref

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Author: Aggarwal A.
Anderson E.
Anderson M.
D'Alberto P.
Del M.
Demmel J.
Jia-Wei H.
Kumar V. B. Y.
Lavin C.
Lin C. Y.
Moss D. J.
Solomonik E.
Vanhoucke V.
Wu E.
Zhou H.
Zhuo L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2020
Field of study

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Equilibrium gas-phase structures of sodium fluoride, bromide, and iodide monomers and dimers

Author: Akishin P. A.
Akishin P. A.
Bauer S. H.
Berkowitz J.
Berkowitz J.
Biermann S.
Blake A. J.
Brain P. T.
Brumer P.
Caris M.
Cederberg J.
Cederberg J.
Cederberg J.
Cederberg J.
Chauhan R. S.
David W. H. Rankin
Derek A. Wann
Dickey R. P.
Friedman L.
Frisch M. J.
Frisch M. J.
Frishknecht A. L.
Groen C. P.
Hamilton W. C.
Hartley J. G.
Hartley J. G.
Hartley J. G.
Head-Gordon M.
Head-Gordon M.
Hedberg L.
Hilderbrandt R. L.
Hilpert K.
Hilpert K.
Hinchley S. L.
Iron M. A.
Jan M. L. Martin
Krishnan R.
Kuchitsu K.
Langhoff S. R.
Lintuluoto M.
Malliavin M.-J.
Martin J. M. L.
Martin T. P.
Mawhorter R. J.
Mawhorter R. J.
McCaffrey P. D.
McLean A. D.
Milne T. A.
Mitzel N. W.
Modisette J.
O’Konski C. T.
Peterson K. A.
Philip D. McCaffrey
Richard J. Mawhorter
Sipachev V. A.
Sipachev V. A.
Sipachev V. A.
Solomonik V. G.
Sæbø S.
Timp B. A.
Törring T.
Weis P.
Welch D. O.
Wetzel T. L.
Woon D. E.
Woon D. E.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 04/03/2014
Field of study

The alkali halides sodium fluoride, sodium bromide, and sodium iodide exist in the gas phase as both monomer and dimer species. A reanalysis of gas electron diffraction (GED) data collected earlier has been undertaken for each of these molecules using the EXPRESS method to yield experimental equilibrium structures. EXPRESS allows amplitudes of vibration to be estimated and correction terms to be applied to each pair of atoms in the refinement model. These quantities are calculated from the ab initio potential-energy surfaces corresponding to the vibrational modes of the monomer and dimer. Because they include many of the effects associated with large-amplitude modes of vibration and anharmonicity, we have been able to determine highly accurate experimental structures. These results are found to be in good agreement with those from high-level core-valence ab initio calculations and are substantially more precise than those obtained in previous structural studies

Crossref

White Rose Research Online

Parcelles de terre chersonésiennes au début du IIIe s. av.n.è.

Author: Gaudey Jacqueline
Nikolaenko G. M.
Solomonik E. I.
Publication venue: PERSÉE : Université de Lyon, CNRS & ENS de Lyon
Publication date: 01/01/1995
Field of study

Solomonik E. I., Nikolaenko G. M., Gaudey Jacqueline. Parcelles de terre chersonésiennes au début du IIIe s. av.n.è.. In: Esclavage et dépendance dans l'historiographie soviétique récente. Besançon : Université de Franche-Comté, 1995. pp. 185-210. (Annales littéraires de l'Université de Besançon, 577

Recommended from our members

A communication-optimal N-body algorithm for direct interactions

Author: Driscoll M
Georganas E
Koanantakool P
Solomonik E
Yelick K
Publication venue: eScholarship, University of California
Publication date: 07/10/2013
Field of study

We consider the problem of communication avoidance in computing interactions between a set of particles in scenarios with and without a cutoff radius for interaction. Our strategy, which we show to be optimal in communication, divides the work in the iteration space rather than simply dividing the particles over processors, so more than one processor may be responsible for computing updates to a single particle. Similar to a force decomposition in molecular dynamics, this approach requires up to √p times more memory than a particle decomposition, but reduces communication costs by factors up to √p and is often faster in practice than a particle decomposition [1]. We examine a generalized force decomposition algorithm that tolerates the memory limited case, i.e. when memory can only hold c copies of the particles for c = 1, 2,...,√p. When c = 1, the algorithm degenerates into a particle decomposition, similarly when c = √p, the algorithm uses a force decomposition. We present a proof that the algorithm is communication-optimal and reduces critical path latency and bandwidth costs by factors of c2 and c, respectively. Performance results from experiments on up to 24K cores of Cray XE-6 and 32K cores of IBM Blue Gene/P machines indicate that the algorithm reduces communication in practice. In some cases, it even outperforms the original force decomposition approach because the right choice of c strikes a balance between the costs of collective and point-to-point communication. Finally, we extend the analysis to include a cutoff radius for direct evaluation of force interactions. We show that with a cutoff, communication optimality still holds. We sketch a generalized algorithm for multi-dimensional space and assess its performance for 1D and 2D simulations on the same systems. © 2013 IEEE

eScholarship - University of California